library("here")
library("dplyr")
library("ggplot2")
library("ade4")
library("kableExtra")
library("SummarizedExperiment")
library("phyloseq")
library("tidyverse")
library("GGally")
library("pheatmap")
library("factoextra")
library("MASS")
library("pheatmap")
library("lattice")
library("ggcorrplot")
library("ggfortify")
library("ggrepel")
library("pracma")
library("dslabs")
library("ggridges")

Chapter Overview

By mesuring and analyzing several variables on the same same subjects, we are able to identify valuable patterns, dependencies, connections or associations between the different variables.

Such analyses are made easy in R as the data are usually represented as matrices

However the major challenge in biodata analysis is the data comes in huge dimensions of variation. Thus the need for leveraging between dimentionality reduction and information loss.

This Chapter focuese on Principal Component Analysis (PCA) as a dimension reduction method using basic data but also data from high-throughput experiments. Geometric explanations of the PCA as well as visualizations that help interprete the output of PCA analyses are presented.

7.1 Goals for this Chapter

7.2 What are the data? Matrices and their motivation

Data used in this Chapter are;

-Turtles: A matrix of three dimensions of biometric measurements on painted turtles.

turtles = read.table(here("Book","data","PaintedTurtles.txt"), header = TRUE)
dim(turtles)
## [1] 48  4
turtles[1:4, ]
##   sex length width height
## 1   f     98    81     38
## 2   f    103    84     38
## 3   f    103    86     42
## 4   f    105    86     40
load(here("Book","data","athletes.RData"))
athletes[1:3, ]
##    m100 long weight highj  m400  m110  disc pole javel  m1500
## 1 11.25 7.43  15.48  2.27 48.90 15.13 49.28  4.7 61.32 268.95
## 2 10.87 7.45  14.97  1.97 47.71 14.46 44.36  5.1 61.76 273.02
## 3 11.18 7.44  14.20  1.97 48.29 14.81 43.66  5.2 64.16 263.20
load(here("Book","data","Msig3transp.RData"))
round(Msig3transp,2)[1:5, 1:6]
##              X3968 X14831 X13492 X5108 X16348  X585
## HEA26_EFFE_1 -2.61  -1.19  -0.06 -0.15   0.52 -0.02
## HEA26_MEM_1  -2.26  -0.47   0.28  0.54  -0.37  0.11
## HEA26_NAI_1  -0.27   0.82   0.81  0.72  -0.90  0.75
## MEL36_EFFE_1 -2.24  -1.08  -0.24 -0.18   0.64  0.01
## MEL36_MEM_1  -2.68  -0.15   0.25  0.95  -0.20  0.17
data("GlobalPatterns", package = "phyloseq")
GPOTUs = as.matrix(t(phyloseq::otu_table(GlobalPatterns)))
GPOTUs[1:4, 6:13]
## OTU Table:          [8 taxa and 4 samples]
##                      taxa are columns
##         246140 143239 244960 255340 144887 141782 215972 31759
## CL3          0      7      0    153      3      9      0     0
## CC1          0      1      0    194      5     35      3     1
## SV1          0      0      0      0      0      0      0     0
## M31Fcsw      0      0      0      0      0      0      0     0
library("SummarizedExperiment")
data("airway", package = "airway")
assay(airway)[1:3, 1:4]
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003        679        448        873        408
## ENSG00000000005          0          0          0          0
## ENSG00000000419        467        515        621        365
metab = t(as.matrix(read.csv(here("Book","data","metabolites.csv"), row.names = 1)))
metab[1:4, 1:4]
##          146.0985388 148.7053275 310.1505057 132.4512963
## KOGCHUM1    29932.36    17055.70     1132.82    785.5129
## KOGCHUM2    94067.61    74631.69    28240.85   5232.0499
## KOGCHUM3   146411.33   147788.71    64950.49  10283.0037
## WTGCHUM1   229912.57   384932.56   220730.39  26115.2007

Task

Tabulate the frequencies of zeros in the airway, GPOTUs, metab and data matrices.

Data Distribution of zeros
airway 314674
GPOTUs 395038
metab 604

These frequencies reflect the sparsity that exist in the datasets.

► Question 7.1

  1. What are the columns of these data matrices usually called ?

    The Columns are the variables

  2. In each of these examples, what are the rows of the matrix ?

    The row are the subjects,sample or objects being studied. except in the RNA_seq example where columns have been swapped

  3. What does a cell in a matrix represent ?

    A record or measurement of a variable for a specific subject

  4. If the data matrix is called athletes and you want to see the value of the third variable for the fifth athlete, what do you type into R?

    athletes[5,3 ]

7.2.1 Low-dimensional data summaries and preparation

Analyzing just one column (uni dimentional) is a univariate analyses. A one number summary (mean or median) of such a column is a zero-dimensional summary.

A correlation coefficient is an example of a mesure used when considering two variables measured together on a set of observations.

► Question 7.2

Compute the matrix of all correlations between the measurements from the turtles data. What do you notice ?

cor(turtles[, -1])
##           length     width    height
## length 1.0000000 0.9783116 0.9646946
## width  0.9783116 1.0000000 0.9605705
## height 0.9646946 0.9605705 1.0000000

► Question 7.3

Using GGally, produce all pairwise scatterplots, as well as the one-dimensional histograms on the diagonal, for the turtles data. Guess the underlying or “true dimension” of these data?

ggpairs(turtles[, -1], axisLabels = "none")

The various variables are positively correlated and all seem to positively affect the size of the turtles.

► Question 7.4

Make a pairs plot of the athletes data. What do you notice?

ggpairs(athletes, axisLabels = "none")

Using the ggpairs reveals subtle corelations which can be better be viewed with a \(pheatmap\).

pheatmap(cor(athletes), cell.width = 10, cell.height = 10)

The variables clusters them into three groups: running, throwing and jumping

7.2.2 Preprocessing the data

Data preprocessing data is essential especially when units with different baselines and scales are involved.

Data transformation is relevant.

Examples of transformation are;

Centering: subtracting the mean, so that the mean of the centered data is at the origin.

Scaling or standardizing: dividing by the standard deviation, so that the new standard deviation is \(1\). Unlike the VST, the aim is to make the scale (as measured by mean and standard deviation) of different variables the same

► Question 7.5

Compute the means and standard deviations of the turtle data, then use the scale function to center and standardize the continuous variables. Call this scaledTurtles, then verify the new values for mean and standard deviation of scaledTurtles. Make a scatterplot of the scaled and centered width and height variables of the turtle data and color the points by their sex.

#find mean
apply(turtles[,-1], 2, sd)
##    length     width    height 
## 20.481602 12.675838  8.392837
#find standard deviation
apply(turtles[,-1], 2, mean)
##    length     width    height 
## 124.68750  95.43750  46.33333
#scale and centre the data then compute mean and sd
scaledTurtles = scale(turtles[, -1])
apply(scaledTurtles, 2, mean)
##        length         width        height 
## -1.432050e-18  1.940383e-17 -2.870967e-16
apply(scaledTurtles, 2, sd)
## length  width height 
##      1      1      1
#convert the scaled data into a dataframe and plot
data.frame(scaledTurtles, sex = turtles[, 1]) %>%
  ggplot(aes(x = width, y = height, group = sex)) +
    geom_point(aes(color = sex)) + coord_fixed()

7.3 Dimension reduction (DR)

DR was onvented by Karl Pearson to reduce a two-variable scatterplot to a single coordinate.

ubsequently used by by statisticians in the 1930s to summarize a battery of psychological tests run on the same subjects.

PCA is a widely used exploratory techique for DR via multivariate analysis.

It is an example of a geometrical projection of points from higher-dimensional spaces onto lower dimensions by a vector \(v\).

PCA is called an unsupervised learning technique because, as in clustering, it treats all variables as having the same status.

The goal is to find a linear mathematical model for an underlying structure for all the variables.

7.3.1 Lower-dimensional projections

As an example, let generate a scatterplot of two variables (weigth ans dics) of the athletics data showing the projection on the horizontal x axis (defined by \(y = 0\) )

athletes = data.frame(scale(athletes))
ath_gg = ggplot(athletes, aes(x = weight, y = disc)) +
  geom_point(size = 2, shape = 21)
ath_gg + geom_point(aes(y = 0), colour = "red") +
  geom_segment(aes(xend = weight, yend = 0), linetype = "dashed")

► Task

  1. Calculate the variance of the red points in Figure 7.6.

  2. Make a plot showing projection lines onto the \(y-axis\) and projected points.

  3. Compute the variance of the points projected onto the vertical \(y-axis\).

#a. Variance of projected weigths
var(data.frame(scale(athletes))$weight)
## [1] 1
#b. projection  onto the y axis.
athletes = data.frame(scale(athletes))
ath_gg = ggplot(athletes, aes(x = weight, y = disc)) +
  geom_point(size = 2, shape = 21)
ath_gg + geom_point(aes(x = 0), colour = "purple") +
  geom_segment(aes(xend = 0, yend = disc), linetype = "dashed")

#Variance of projected discs

var(data.frame(scale(athletes))$disc)
## [1] 1

7.3.2 How do we summarize two-dimensional data by a line?

-Summarizing/projecting data from two dimentions onto a line can easily lead to loss of valuable infomation about the variables. Hence the need for effective approaches.

-Regression lines are good for such projections

Regression of the disc variable on weight. This can be done with the \(lm\) function.

reg1 = lm(disc ~ weight, data = athletes)
a1 = reg1$coefficients[1] # intercept
b1 = reg1$coefficients[2] # slope
pline1 = ath_gg + geom_abline(intercept = a1, slope = b1,
    col = "blue", lwd = 1.5)
pline1 + geom_segment(aes(xend = weight, yend = reg1$fitted),
    colour = "red", arrow = arrow(length = unit(0.15, "cm")))

Conversely, a regression of weight on discus is obtained by

reg2 = lm(weight ~ disc, data = athletes)
a2 = reg2$coefficients[1] # intercept
b2 = reg2$coefficients[2] # slope
pline2 = ath_gg + geom_abline(intercept = -a2/b2, slope = 1/b2,
    col = "darkgreen", lwd = 1.5)
pline2 + geom_segment(aes(xend=reg2$fitted, yend=disc),
    colour = "orange", arrow = arrow(length = unit(0.15, "cm")))

The relationship (i.e the slope and intercept) differs depending on which of the variables we choose to be the predictor and which the response.

► Question 7.6

How large is the variance of the projected points that lie on the blue regression line of Figure 7.7? Compare this to the variance of the data when projected on the original axes, weight and disc.

# variances of the points along the original 
var(athletes$disc)
## [1] 1
var(athletes$weight)
## [1] 1
# variance of lm fitted data
var(reg2$fitted)
## [1] 0.6502039
var(reg1$fitted)
## [1] 0.6502039

A line that minimizes distances in both directions

The goal is to minimize the sum of squares of the orthogonal (perpendicular) projections of data points onto the line.

This line is called the principal component line

Notice the \(svd\) function was used here This can be obtained by:

xy = cbind(athletes$disc, athletes$weight)
svda = svd(xy)
pc = xy %*% svda$v[, 1] %*% t(svda$v[, 1])
bp = svda$v[2, 1] / svda$v[1, 1]
ap = mean(pc[, 2]) - bp * mean(pc[, 1])
ath_gg + geom_segment(xend = pc[, 1], yend = pc[, 2]) +
  geom_abline(intercept = ap, slope = bp, col = "purple", lwd = 1.5)

Ploting all the lines;
The blue line minimizes the sum of squares of the vertical residuals, the green line minimizes the horizontal residuals, the purple line, called the **principal component**, minimizes the orthogonal projections. Notice the ordering of the slopes of the three lines.

The blue line minimizes the sum of squares of the vertical residuals, the green line minimizes the horizontal residuals, the purple line, called the principal component, minimizes the orthogonal projections. Notice the ordering of the slopes of the three lines.

► Question 7.7

What is particular about the slope of the purple line? Redo the plots on the original (unscaled) variables. What happens?

The purple line has a slope of 1.

#Disc onto Weight
load(here("Book","data","athletes.RData"))
d_w = ggplot(athletes, aes(x = weight, y = disc)) +
  geom_point(size = 2, shape = 21)
d_w + geom_point(aes(y = 0), colour = "red") +
  geom_segment(aes(xend = weight, yend = 0), linetype = "dashed")

#Weight onto disc
w_d = ggplot(athletes, aes(x = weight, y = disc)) +
  geom_point(size = 2, shape = 21)
w_d + geom_point(aes(x = 0), colour = "red") +
  geom_segment(aes(yend = disc, xend = 0), linetype = "dashed")

► Question 7.8

Compute the variance of the points on the purple line.

apply(pc, 2, var)
## [1] 0.9031761 0.9031761
sum(apply(pc, 2, var))
## [1] 1.806352

7.4 The new linear combinations

Principal components are linear combinations of the variables that were originally measured, they provide a new coordinate system.

The result is a new variable, \(V\), and the coefficients are called the loadings.

7.4.1 Optimal lines

Principal component minimizes the distance to the line, and it also maximizes the variance of the projections along the line.

7.5 The PCA workflow

PCA is based on the principle of finding the axis showing the largest inertia/variability, removing the variability in that direction and then iterating to find the next best orthogonal axis so on.

All the axes can be found in one operation called the Singular Value Decomposition

The workflow;

  1. the means are variances are computed and the choice of whether to work with rescaled covariances –the correlation matrix– or not has to be made

  2. the choice of \(k\), the number of components relevant to the data is made. That is the rank of the approximation we choose.

  3. The end results of the PCA workflow are useful maps of both the variables and the samples

7.6 The inner workings of PCA: rank reduction

t1 <- matrix(c('X',1:4,2,' ',' ',' ',' ',4,' ',' ',' ',' ',8,' ',' ',' ',' '), ncol = 4)
            t2 <- t1; t2[2:5,2] <- as.numeric(t2[2:5,1]) * 2
            t3 <- t2; t3[2:5,3] <- as.numeric(t3[2:5,2]) * 2
            t4 <- t3; t4[2:5,4] <- as.numeric(t4[2:5,3]) * 2
            
            tt1 <- knitr::kable(t1, 'html', table.attr = 'class = "console"', align = 'lrrr', col.names = c(' ', ' ',' ',' '))  %>% 
                column_spec(1, border_right = T) %>%
                row_spec(1, extra_css = 'border-bottom:1px solid black;', bold=FALSE) %>%
                row_spec(0, extra_css = 'border-bottom: 0;', bold=FALSE) %>%
                add_header_above(header = c('Step 0' = 4))
            tt2 <- knitr::kable(t2, 'html', table.attr = 'class = "console"', align = 'lrrr', col.names = c(' ', ' ',' ',' '))  %>% 
                column_spec(1, border_right = T) %>%
                row_spec(1, extra_css = 'border-bottom:1px solid black;') %>%
                row_spec(0, extra_css = 'border-bottom: 0;', bold=FALSE) %>%
                add_header_above(header = c('Step 1' = 4))
            tt3 <- knitr::kable(t3, 'html', table.attr = 'class = "console"', align = 'lrrr', col.names = c(' ', ' ',' ',' '))  %>% 
                column_spec(1, border_right = T) %>%
                row_spec(1, extra_css = 'border-bottom:1px solid black;') %>%
                row_spec(0, extra_css = 'border-bottom: 0;', bold=FALSE) %>%
                add_header_above(header = c('Step 2' = 4))
            tt4 <- knitr::kable(t4, 'html', table.attr = 'class = "console"', align = 'lrrr', col.names = c(' ', ' ',' ',' '))  %>% 
                column_spec(1, border_right = T) %>%
                row_spec(1, extra_css = 'border-bottom:1px solid black;') %>%
                row_spec(0, extra_css = 'border-bottom: 0;', bold=FALSE) %>%
                add_header_above(header = c('Step 3' = 4))
            
            knitr::kable(data.frame(as.character(tt1), as.character(tt2), as.character(tt3), as.character(tt4)), 
             col.names = NULL,
             format = 'html',
             escape = FALSE,
             table.attr = 'class = "kable_wrapper"')
Step 0
X 2 4 8
1
2
3
4
Step 1
X 2 4 8
1 2
2 4
3 6
4 8
Step 2
X 2 4 8
1 2 4
2 4 8
3 6 12
4 8 16
Step 3
X 2 4 8
1 2 4 8
2 4 8 16
3 6 12 24
4 8 16 32

\(X\) has 12 elements, while in terms of \(u\) and \(v\) it can be expressed by only 7 numbers. Thus writing \(X=u∗v^t\) is more economical than spelling out the full matrix \(X\)

In reality our goal is to rather reverse the process; find the deconposed components.

Since the decomposition is not unique (there are several candidate choices for the vectors \(u\) and \(u\)). we go for the decomposition depicting orthogonalnormality.

par(mfrow=c(1,2))
knitr::include_graphics(c(here("Book","image_files","images","SVD-mosaicXplot1.png"),here("Book","image_files","images","SVD-mosaicXplot2.png"),here("Book","image_files","images","SVD-mosaicXplot3.png")))

X = matrix(c(780,  75, 540,
             936,  90, 648,
            1300, 125, 900,
             728,  70, 504), nrow = 3)
u = c(0.8196, 0.0788, 0.5674)
v = c(0.4053, 0.4863, 0.6754, 0.3782)
s1 = 2348.2
sum(u^2)
## [1] 0.9998964
sum(v^2)
## [1] 0.9999562
s1 * u %*% t(v)
##           [,1]      [,2]      [,3]      [,4]
## [1,] 780.03419 935.92555 1299.8645 727.87794
## [2,]  74.99597  89.98406  124.9748  69.98143
## [3,] 540.00903 647.93089  899.8818 503.90183
X - s1 * u %*% t(v)
##              [,1]       [,2]       [,3]       [,4]
## [1,] -0.034187016 0.07445066 0.13548011 0.12205890
## [2,]  0.004033752 0.01594279 0.02522674 0.01856789
## [3,] -0.009026004 0.06911092 0.11819353 0.09816522

► Question

Try svd(X) in R. Look at the components of the output of the svd function carefully. Check the norm of the columns of the matrices that result from this call. Where did the above value of s1 = 2348.2 come from?

svd(X)$u[, 1]
## [1] 0.81963482 0.07881104 0.56743949
svd(X)$v[, 1]
## [1] 0.4052574 0.4863089 0.6754290 0.3782403
sum(svd(X)$u[, 1]^2)
## [1] 1
sum(svd(X)$v[, 1]^2)
## [1] 1
svd(X)$d
## [1] 2.348244e+03 2.141733e-13 6.912584e-15

We see that the second and third singular values are 0 (up to the numeric precision we care about). That is why we say that X is of rank 1.

7.6.2 How do we find such a decomposition in a unique way?

in the example above, the decomposition has three elements: the horizontal and vertical singular vectors, and the diagonal corner, called the singular value.

These can be found using the singular value decomposition function (\(svd\))

Xtwo = matrix(c(12.5, 35.0, 25.0, 25, 9, 14, 26, 18, 16, 21, 49, 32,
       18, 28, 52, 36, 18, 10.5, 64.5, 36), ncol = 4, byrow = TRUE)
USV = svd(Xtwo)

► Question

Check how each successive pair of singular vectors improves our approximation to Xtwo. What do you notice about the third and fourth singular values?

Xtwo - USV$d[1] * USV$u[, 1] %*% t(USV$v[, 1])
##             [,1]       [,2]        [,3]        [,4]
## [1,]  0.87481760  19.045230 -10.1044650  1.74963521
## [2,]  0.08079747   1.759002  -0.9332405  0.16159494
## [3,] -0.04700978  -1.023427   0.5429803 -0.09401956
## [4,]  0.16159494   3.518005  -1.8664809  0.32318987
## [5,] -0.69632883 -15.159437   8.0428540 -1.39265765
Xtwo - USV$d[1] * USV$u[, 1] %*% t(USV$v[, 1]) -
       USV$d[2] * USV$u[, 2] %*% t(USV$v[, 2])
##              [,1]          [,2]         [,3]         [,4]
## [1,] 7.216450e-15 -1.065814e-14 8.881784e-15 4.884981e-15
## [2,] 2.040035e-15 -5.995204e-15 1.054712e-14 3.219647e-15
## [3,] 2.865763e-15 -9.547918e-15 1.554312e-15 6.231127e-15
## [4,] 4.385381e-15 -5.773160e-15 1.776357e-14 7.049916e-15
## [5,] 5.107026e-15 -1.776357e-15 1.776357e-14 1.776357e-14

The third and fourth singular values are so small they do not improve the approximation, we can conclude that Xtwo is of rank 2.

► Task

Check the orthonormality by computing the cross product of the \(U\) and \(V\) matrices:

t(USV$u) %*% USV$u
##               [,1]          [,2]          [,3]          [,4]
## [1,]  1.000000e+00 -1.665335e-16  0.000000e+00 -8.326673e-17
## [2,] -1.665335e-16  1.000000e+00  1.665335e-16 -5.551115e-17
## [3,]  0.000000e+00  1.665335e-16  1.000000e+00 -5.551115e-17
## [4,] -8.326673e-17 -5.551115e-17 -5.551115e-17  1.000000e+00
t(USV$v) %*% USV$v
##               [,1]          [,2]          [,3]          [,4]
## [1,]  1.000000e+00  8.326673e-17  1.387779e-17 -5.551115e-17
## [2,]  8.326673e-17  1.000000e+00 -3.642919e-17 -6.938894e-17
## [3,]  1.387779e-17 -3.642919e-17  1.000000e+00  2.775558e-17
## [4,] -5.551115e-17 -6.938894e-17  2.775558e-17  1.000000e+00

Execute the \(svd\) on the rescaled turtles matrix

turtles.svd = svd(scaledTurtles)
turtles.svd$d
## [1] 11.746475  1.419035  1.003329
dim(turtles.svd$u)
## [1] 48  3

► Question

What can you conclude about the turtles matrix from the svd output?

The coefficients for the three variables are equal

7.6.3 Singular value decomposition

The Singular Value Decomposition is

\(X = USV^t,VtV=I,U^tU=I\)

where S is the diagonal matrix of singular values, \(V^t\) is the transpose of \(V\), and I is the Identity matrix.

7.6.4 Principal components

The principal component transformation is defined so that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each successive component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components

► Question

Compute the first principal component for the turtles data by multiplying by the first singular value usv\(d[1] by usv\)u[,1]. What is another way of computing it ?

turtles.svd$d[1] %*% turtles.svd$u[,1]
##           [,1]      [,2]      [,3]      [,4]       [,5]       [,6]        [,7]
## [1,] -1.983668 -1.705579 -1.340216 -1.420782 -0.9423811 0.04689258 -0.09048395
##         [,8]      [,9]     [,10]     [,11]     [,12]     [,13]    [,14]
## [1,] 0.71721 0.8540019 0.8540019 0.5854404 0.8016959 0.7846503 0.858507
##         [,15]   [,16]    [,17]    [,18]    [,19]    [,20]    [,21]    [,22]
## [1,] 1.353953 1.93447 1.808307 1.989887 2.890979 2.776548 2.907215 3.140809
##         [,23]    [,24]     [,25]     [,26]     [,27]     [,28]     [,29]
## [1,] 3.362087 4.562009 -2.512689 -2.439124 -2.291411 -1.693556 -1.688242
##          [,30]     [,31]     [,32]     [,33]     [,34]     [,35]     [,36]
## [1,] -1.910913 -1.654375 -1.597856 -1.683736 -1.086174 -1.103512 -1.166447
##           [,37]      [,38]      [,39]      [,40]      [,41]     [,42]
## [1,] -0.7219127 -0.8307375 -0.7851402 -0.6374268 -0.8600987 -0.403541
##           [,43]      [,44]         [,45]       [,46]     [,47]     [,48]
## [1,] -0.4211712 -0.1937018 -0.0003910952 -0.01772898 0.1355914 0.8187415
scaledTurtles %*% turtles.svd$v[,1]
##                [,1]
##  [1,] -1.9836684602
##  [2,] -1.7055794726
##  [3,] -1.3402164350
##  [4,] -1.4207818191
##  [5,] -0.9423811150
##  [6,]  0.0468925757
##  [7,] -0.0904839545
##  [8,]  0.7172099610
##  [9,]  0.8540018654
## [10,]  0.8540018654
## [11,]  0.5854403531
## [12,]  0.8016958980
## [13,]  0.7846503260
## [14,]  0.8585070441
## [15,]  1.3539533202
## [16,]  1.9344701590
## [17,]  1.8083074735
## [18,]  1.9898872486
## [19,]  2.8909792543
## [20,]  2.7765475313
## [21,]  2.9072153955
## [22,]  3.1408088253
## [23,]  3.3620866667
## [24,]  4.5620089799
## [25,] -2.5126887623
## [26,] -2.4391243571
## [27,] -2.2914109209
## [28,] -1.6935561972
## [29,] -1.6882415877
## [30,] -1.9109134857
## [31,] -1.6543752488
## [32,] -1.5978564155
## [33,] -1.6837364090
## [34,] -1.0861739982
## [35,] -1.1035118831
## [36,] -1.1664470694
## [37,] -0.7219127043
## [38,] -0.8307375049
## [39,] -0.7851402035
## [40,] -0.6374267672
## [41,] -0.8600986652
## [42,] -0.4035410247
## [43,] -0.4211712224
## [44,] -0.1937018329
## [45,] -0.0003910952
## [46,] -0.0177289800
## [47,]  0.1355913785
## [48,]  0.8187414700

7.7 Plotting the observations in the principal plane

► Question Looking at the athelet data what part of the output of the svd functions leads us to the first PC coefficients, also known as the PC loadings ?

svda$v[,1]
## [1] -0.7071068 -0.7071068

If we rotate the (discus,weight) plane making the purple line the horizontal\(x axis\), we obtain what is know as the first principal plane.

ppdf = tibble(PC1n = -svda$u[, 1] * svda$d[1],
              PC2n = svda$u[, 2] * svda$d[2])
ggplot(ppdf, aes(x = PC1n, y = PC2n)) + geom_point() + xlab("PC1 ")+
    ylab("PC2") + geom_point(aes(x=PC1n,y=0),color="red") +
    geom_segment(aes(xend = PC1n, yend = 0), color = "red") +
    geom_hline(yintercept = 0, color = "purple", lwd=1.5, alpha=0.5) +
    xlim(-3.5, 2.7) + ylim(-2,2) + coord_fixed()

segm = tibble(xmin = pmin(ppdf$PC1n, 0), xmax = pmax(ppdf$PC1n, 0), yp = seq(-1, -2, length = nrow(ppdf)), yo = ppdf$PC2n)
ggplot(ppdf, aes(x = PC1n, y = PC2n)) + geom_point() + ylab("PC2") + xlab("PC1") +
    geom_hline(yintercept=0,color="purple",lwd=1.5,alpha=0.5) +
    geom_point(aes(x=PC1n,y=0),color="red")+
    xlim(-3.5, 2.7)+ylim(-2,2)+coord_fixed() +
    geom_segment(aes(xend=PC1n,yend=0), color="red")+
    geom_segment(data=segm,aes(x=xmin,xend=xmax,y=yo,yend=yo), color="blue",alpha=0.5)

# the mean sums of squares of the red segments corresponds to the square of the second singular value
svda$d[2]^2
## [1] 6.196729
#The variance of the red points is var(ppdf$PC1n), which is larger than the number caluclated in a) by design of the first PC

#We take the ratios of the standard deviations explained by the points on the vertical and horizontal axes by computing:

sd(ppdf$PC1n)/sd(ppdf$PC2n)
## [1] 3.054182
svda$d[1]/svda$d[2]
## [1] 3.054182

► Task

Use prcomp to compute the PCA of the first two columns of the athletes data, look at the output. Compare to the singular value decomposition.

prcomp(athletes[,1:2])
## Standard deviations (1, .., p=2):
## [1] 0.3452872 0.1805680
## 
## Rotation (n x k) = (2 x 2):
##             PC1       PC2
## m100  0.5541639 0.8324075
## long -0.8324075 0.5541639
svd(athletes[,1:2])
## $d
## [1] 76.269677  1.952953
## 
## $u
##             [,1]         [,2]
##  [1,] -0.1767451 -0.113423360
##  [2,] -0.1726840 -0.226609836
##  [3,] -0.1760415 -0.137000969
##  [4,] -0.1694264 -0.265162962
##  [5,] -0.1742018 -0.176703317
##  [6,] -0.1741439 -0.354214297
##  [7,] -0.1732940  0.031420171
##  [8,] -0.1711520  0.038838147
##  [9,] -0.1734554 -0.007063171
## [10,] -0.1754672 -0.054148604
## [11,] -0.1734581 -0.207350719
## [12,] -0.1753370 -0.093816061
## [13,] -0.1732155 -0.116244447
## [14,] -0.1734474 -0.159046280
## [15,] -0.1744533 -0.182588996
## [16,] -0.1725101 -0.006297023
## [17,] -0.1742767  0.238011364
## [18,] -0.1772543  0.160313422
## [19,] -0.1720072  0.005474335
## [20,] -0.1702281 -0.057004582
## [21,] -0.1792376 -0.008908758
## [22,] -0.1765106  0.129666020
## [23,] -0.1757169  0.073490749
## [24,] -0.1740573  0.098983727
## [25,] -0.1725717 -0.095836107
## [26,] -0.1734028  0.167696506
## [27,] -0.1719162  0.039639187
## [28,] -0.1766613  0.139487116
## [29,] -0.1731219  0.118207953
## [30,] -0.1771143  0.102187888
## [31,] -0.1702112  0.458637467
## [32,] -0.1721329  0.378954371
## [33,] -0.1785929  0.078262097
## 
## $v
##            [,1]       [,2]
## [1,] -0.8433808  0.5373163
## [2,] -0.5373163 -0.8433808

7.7.1 PCA of the turtles data

We can now get summary statistics for the 1 and 2-dimensional data. Now we are going to answer the question about the “true” dimensionality of these rescaled data. Let’s consider the scaledTurtle data.

pcaturtles = princomp(scaledTurtles)
pcaturtles
## Call:
## princomp(x = scaledTurtles)
## 
## Standard deviations:
##    Comp.1    Comp.2    Comp.3 
## 1.6954576 0.2048201 0.1448180 
## 
##  3  variables and  48 observations.
fviz_eig(pcaturtles, geom = "bar", bar_width = 0.4) + ggtitle("")

This Scree plot shows plots of eigenvalues of principal components.

Compare PCA functions have been created by different teams who worked in different areas

svd(scaledTurtles)$v[, 1]
## [1] 0.5787981 0.5779840 0.5752628
prcomp(turtles[, -1])$rotation[, 1]
##    length     width    height 
## 0.8068646 0.4947448 0.3227958
princomp(scaledTurtles)$loadings[, 1]
##    length     width    height 
## 0.5787981 0.5779840 0.5752628
dudi.pca(turtles[, -1], nf = 2, scannf = FALSE)$c1[, 1]
## [1] -0.5787981 -0.5779840 -0.5752628

► Question

From the prcomp function (call it res) are in the scores slot of the result. Take a look at PC1 for the turtles and compare it to res$scores. Compare the standard deviation sd1 to that in the res object and to the standard deviation of the scores.

res = princomp(scaledTurtles)
PC1 = scaledTurtles %*% res$loadings[,1]
sd1 = sqrt(mean(res$scores[, 1]^2))

► Question

Check the orthogonality of the res$scores matrix.Why can’t we say that it is orthonormal?

Combine both the PC scores (US) and the loadings-coefficients (V) to form a biplot (plot where both samples and variables are represented).

fviz_pca_biplot(pcaturtles, label = "var", habillage = turtles[, 1]) +
  ggtitle("")

► Question

Compare the variance of each new coordinate to the eigenvalues returned by the PCA dudi.pca function.

pcadudit = dudi.pca(scaledTurtles, nf = 2, scannf = FALSE)
apply(pcadudit$li, 2, function(x) sum(x^2)/48)
##      Axis1      Axis2 
## 2.93573765 0.04284387
pcadudit$eig
## [1] 2.93573765 0.04284387 0.02141848

The lengths of the arrows indicate the quality of the projections onto the first principal plane:

fviz_pca_var(pcaturtles, col.circle = "black") + ggtitle("") +
  xlim(c(-1.2, 1.2)) + ylim(c(-1.2, 1.2))

► Question

Explain the relationships between the number of rows of our turtles data matrix and the following numbers:

svd(scaledTurtles)$d/pcaturtles$sdev
##   Comp.1   Comp.2   Comp.3 
## 6.928203 6.928203 6.928203
sqrt(47)
## [1] 6.855655

7.7.2 A complete analysis: the decathlon athletes

cor(athletes) %>% round(1)
##        m100 long weight highj m400 m110 disc pole javel m1500
## m100    1.0 -0.5   -0.2  -0.1  0.6  0.6  0.0 -0.4  -0.1   0.3
## long   -0.5  1.0    0.1   0.3 -0.5 -0.5  0.0  0.3   0.2  -0.4
## weight -0.2  0.1    1.0   0.1  0.1 -0.3  0.8  0.5   0.6   0.3
## highj  -0.1  0.3    0.1   1.0 -0.1 -0.3  0.1  0.2   0.1  -0.1
## m400    0.6 -0.5    0.1  -0.1  1.0  0.5  0.1 -0.3   0.1   0.6
## m110    0.6 -0.5   -0.3  -0.3  0.5  1.0 -0.1 -0.5  -0.1   0.1
## disc    0.0  0.0    0.8   0.1  0.1 -0.1  1.0  0.3   0.4   0.4
## pole   -0.4  0.3    0.5   0.2 -0.3 -0.5  0.3  1.0   0.3   0.0
## javel  -0.1  0.2    0.6   0.1  0.1 -0.1  0.4  0.3   1.0   0.1
## m1500   0.3 -0.4    0.3  -0.1  0.6  0.1  0.4  0.0   0.1   1.0
pca.ath = dudi.pca(athletes, scannf = FALSE)
pca.ath$eig
##  [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952
##  [8] 0.3067981 0.2669494 0.1018542
fviz_eig(pca.ath, geom = "bar", bar_width = 0.3) + ggtitle("")

fviz_pca_var(pca.ath, col.circle = "black") + ggtitle("")

athletes[, c(1, 5, 6, 10)] = -athletes[, c(1, 5, 6, 10)]
cor(athletes) %>% round(1)
##        m100 long weight highj m400 m110 disc pole javel m1500
## m100    1.0  0.5    0.2   0.1  0.6  0.6  0.0  0.4   0.1   0.3
## long    0.5  1.0    0.1   0.3  0.5  0.5  0.0  0.3   0.2   0.4
## weight  0.2  0.1    1.0   0.1 -0.1  0.3  0.8  0.5   0.6  -0.3
## highj   0.1  0.3    0.1   1.0  0.1  0.3  0.1  0.2   0.1   0.1
## m400    0.6  0.5   -0.1   0.1  1.0  0.5 -0.1  0.3  -0.1   0.6
## m110    0.6  0.5    0.3   0.3  0.5  1.0  0.1  0.5   0.1   0.1
## disc    0.0  0.0    0.8   0.1 -0.1  0.1  1.0  0.3   0.4  -0.4
## pole    0.4  0.3    0.5   0.2  0.3  0.5  0.3  1.0   0.3   0.0
## javel   0.1  0.2    0.6   0.1 -0.1  0.1  0.4  0.3   1.0  -0.1
## m1500   0.3  0.4   -0.3   0.1  0.6  0.1 -0.4  0.0  -0.1   1.0
pcan.ath = dudi.pca(athletes, nf = 2, scannf = FALSE)
pcan.ath$eig
##  [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952
##  [8] 0.3067981 0.2669494 0.1018542
fviz_pca_var(pcan.ath, col.circle="black") + ggtitle("")

fviz_pca_ind(pcan.ath) + ggtitle("") + ylim(c(-2.5,5.7))

data("olympic", package = "ade4")
olympic$score
##  [1] 8488 8399 8328 8306 8286 8272 8216 8189 8180 8167 8143 8114 8093 8083 8036
## [16] 8021 7869 7860 7859 7781 7753 7745 7743 7623 7579 7517 7505 7422 7310 7237
## [31] 7231 7016 6907

7.7.3 How to choose \(k\), the number of dimensions ?

The screeplot of the variances of the new variables is used.

7.8 PCA as an exploratory tool: using extra information

pcaMsig3 = dudi.pca(Msig3transp, center = TRUE, scale = TRUE,
                    scannf = FALSE, nf = 4)
fviz_screeplot(pcaMsig3) + ggtitle("")

ids = rownames(Msig3transp)
celltypes = factor(substr(ids, 7, 9))
status = factor(substr(ids, 1, 3))
table(celltypes)
## celltypes
## EFF MEM NAI 
##  10   9  11
cbind(pcaMsig3$li, tibble(Cluster = celltypes, sample = ids)) %>%
ggplot(aes(x = Axis1, y = Axis2)) +
  geom_point(aes(color = Cluster), size = 5) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_vline(xintercept = 0, linetype = 2) +
  scale_color_discrete(name = "Cluster") + coord_fixed()

7.8.1 Mass Spectroscopy Data Analysis

load(here("Book", "data", "mat1xcms.RData"))
dim(mat1)
## [1] 399  12
pcamat1 = dudi.pca(t(mat1), scannf = FALSE, nf = 3)
fviz_eig(pcamat1, geom = "bar", bar_width = 0.7) + ggtitle("")

dfmat1 = cbind(pcamat1$li, tibble(
    label = rownames(pcamat1$li),
    number = substr(label, 3, 4),
    type = factor(substr(label, 1, 2))))
pcsplot = ggplot(dfmat1,
  aes(x=Axis1, y=Axis2, label=label, group=number, colour=type)) +
 geom_text(size = 4, vjust = -0.5)+ geom_point(size = 3)+ylim(c(-18,19))
pcsplot + geom_hline(yintercept = 0, linetype = 2) +
  geom_vline(xintercept = 0, linetype = 2)

pcsplot + geom_line(colour = "red")

7.8.2 Biplots and scaling

Generate a biplot of a simple data set where chemical measurements were made on different wines for which we also have a categorical wine.class variable

load(here("Book", "data", "wine.RData"))
load(here("Book", "data", "wineClass.RData"))
pheatmap(1 - cor(wine), treeheight_row = 0.2)

winePCAd = dudi.pca(wine, scannf=FALSE)
table(wine.class)
## wine.class
##     barolo grignolino    barbera 
##         59         71         48
fviz_pca_biplot(winePCAd, geom = "point", habillage = wine.class,
   col.var = "violet", addEllipses = TRUE, ellipse.level = 0.69) +
   ggtitle("") + coord_fixed()

A biplot is a simultaneous representation of both the space of observations and the space of variables.

7.8.3 An example of weighted PCA

We want to see variability between different groups or observations as weighted mesurements. Lets try with the Hiiragi data.

data("x", package = "Hiiragi2013")
xwt = x[, x$genotype == "WT"]
sel = order(rowVars(Biobase::exprs(xwt)), decreasing = TRUE)[1:100]
xwt = xwt[sel, ]
tab = table(xwt$sampleGroup)
tab
## 
##      E3.25 E3.5 (EPI)  E3.5 (PE) E4.5 (EPI)  E4.5 (PE) 
##         36         11         11          4          4
xwt$weight = 1 / as.numeric(tab[xwt$sampleGroup])
pcaMouse = dudi.pca(as.data.frame(t(Biobase::exprs(xwt))),
  row.w = xwt$weight,
  center = TRUE, scale = TRUE, nf = 2, scannf = FALSE)
fviz_eig(pcaMouse) + ggtitle("")

fviz_pca_ind(pcaMouse, geom = "point", col.ind = xwt$sampleGroup) +
  ggtitle("") + coord_fixed()

Exercises

► Exercise 7.1

7.1a If a matrix \(X\) has no rows and no columns which are all zeros, then is this decomposition unique?

No, rather the rank is one factor for uniqueness of decomposition

7.1b Generate a rank-one matrix: Start by taking a vector of length 15 with values from 2 to 30 in increments of 2, and a vector of length 4 with values 3,6,9,12, take their ‘product’

u = seq(2, 30, by = 2)
v = seq(3, 12, by = 3)
X1 = u %*% t(v)

To ensure the multiplicity of the two matrices, the the number of columns in \(v\) must equals the number of rows in \(u\). This the t() function was used to transpose \(v\).

7.1c Add some noise in the form a matrix we call Materr so we have an “approximately rank-one” matrix.

Materr = matrix(rnorm(60,1),nrow=15,ncol=4)
X = X1+Materr
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(reshape2)
library(ggplot2)

longData<-melt(X1)
longData<-longData[longData$value!=0,]

ggplot(longData, aes(x = Var2, y = Var1)) + 
  geom_raster(aes(fill=value)) + 
  scale_fill_gradient(low="grey90", high="red") +
  labs(x="Columns", y="Rows", title="Matrix") +
  theme_bw() + theme(axis.text.x=element_text(size=9, angle=0, vjust=0.3),
                     axis.text.y=element_text(size=9),
                     plot.title=element_text(size=11))

ggplot(longData, aes(x = Var2, y = Var1, col = value, fill = value, label = value)) +
  geom_tile() +
  geom_text(col = "black") +
  theme_minimal() +
  scale_fill_gradient2(low = "white", mid = "yellow", high = "red") +
  scale_color_gradient2(low = "white", mid = "yellow", high = "red")

7.1 Redo the same analyses with a rank 2 matrix

Y= matrix(c(12.5, 35.0, 25.0, 25, 9, 14, 26, 18, 16, 21, 49, 32,18, 28, 52, 36), ncol = 4, byrow = TRUE)

longData1 <-melt(Y)
longData1 <-longData[longData$value!=0,]

ggplot(longData1, aes(x = Var2, y = Var1)) + 
  geom_raster(aes(fill=value)) + 
  scale_fill_gradient(low="grey90", high="red") +
  labs(x="Columns", y="Rows", title="Matrix") +
  theme_bw() + theme(axis.text.x=element_text(size=9, angle=0, vjust=0.3),
                     axis.text.y=element_text(size=9),
                     plot.title=element_text(size=11))

► Exercise 7.2 7.2 a Create highly correlated bivariate data such as that shown in Figure 7.35.

# Define parameters
µ1 = 1; µ2 = 3.5; a1=3.5; a2=1.5; ρ=0.9;

sigma = matrix(c(a1^2, a1*a2*ρ, a1*a2*ρ, a2^2),2)
bv_data = data.frame(mvrnorm(50, mu = c(µ1,µ2), sigma))

ggplot(data.frame(bv_data),aes(x=X1,y=X2)) +
  geom_point()

Check the rank of the matrix by looking at its singular values.

(svd(scale(bv_data))$v)
##           [,1]       [,2]
## [1,] 0.7071068  0.7071068
## [2,] 0.7071068 -0.7071068

7.2 b Perform a PCA and show the rotated principal component axes.

bv_pca = prcomp(bv_data)
autoplot(bv_pca,loadings = TRUE,
         loadings.colour = 'orange',loadings.label = TRUE)

#I like the autoplot becase you donot need to worry about elongation. Just use the percentage indicated

► Exercise 7.3 Part (A) in Figure 7.35 shows a very elongated plotting region, why?

The elongation is proportional to the amount of feature (covariate) variability explained by the respective axis.

What happens if you do not use the coord_fixed() option and have a square plotting zone? Why can this be misleading?

mu1 = 1; mu2 = 2; s1=2.5; s2=0.8; rho=0.9;
sigma = matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2)
sim2d = data.frame(mvrnorm(50, mu = c(mu1,mu2), Sigma = sigma))
svd(scale(sim2d))$d
## [1] 9.652671 2.196803
svd(scale(sim2d))$v[,1]
## [1] -0.7071068 -0.7071068
ggplot(data.frame(sim2d),aes(x=X1,y=X2)) +
    geom_point()

respc=princomp(sim2d)
dfpc = data.frame(pc1=respc$scores[,1],
pc2 = respc$scores[,2])
 ggplot(dfpc,aes(x=pc1,y=pc2)) +
   geom_point() #+ coord_fixed(2)

It will be difficult to know which proportion of the covariates is explained by either components.

► Exercise 7.4 7.4a Make a correlation circle for the unweighted Hiiragi data xwt. Which genes have the best projections on the first principal plane (best approximation)?

data("x", package = "Hiiragi2013")
xwt = x[, x$genotype == "WT"]

xwt_pca = prcomp(as.data.frame(t(Biobase::exprs(xwt))))
# use screeplot to view the proportion of covariates explained by the the various componets
screeplot(xwt_pca)

fviz_pca_var(xwt_pca) + ggtitle("") + coord_fixed()

#compare this with the autoplot
autoplot(xwt_pca,loadings = TRUE)

Which genes have the best projections on the first principal plane (best approximation)?

#Not sure of this; could it be genes with the least covariate? Check the PC1
head(xwt_pca$rotation)
##                       PC1          PC2         PC3           PC4           PC5
## 1415670_at    0.014979074  0.002785063 0.004718063  0.0008828813 -0.0015354698
## 1415671_at    0.003735398  0.008537833 0.012276599 -0.0015362852  0.0065355311
## 1415672_at   -0.004104042  0.011274405 0.002571740 -0.0015993642  0.0026008362
## 1415673_at    0.008555240 -0.004056765 0.004879796  0.0063981918 -0.0030953000
## 1415674_a_at  0.006632186  0.006343311 0.006144117 -0.0087736005  0.0033053253
## 1415675_at    0.003275449  0.001912670 0.005875177 -0.0006834972  0.0009383057
##                        PC6           PC7          PC8           PC9
## 1415670_at   -0.0002440700  9.754898e-05  0.004405116 -0.0039091641
## 1415671_at   -0.0049637968  1.275980e-03  0.009131999 -0.0018218103
## 1415672_at    0.0048257365 -2.300061e-04 -0.002329420  0.0005882527
## 1415673_at   -0.0048764620  2.697455e-03 -0.003492900  0.0073228526
## 1415674_a_at -0.0005396141 -7.248981e-03  0.010478642  0.0035579033
## 1415675_at    0.0055220170  2.781140e-03 -0.006683983  0.0012973931
##                      PC10         PC11          PC12          PC13         PC14
## 1415670_at    0.004323296 -0.001813514  0.0055203280  0.0049499764 0.0032719238
## 1415671_at    0.002730358 -0.002895581  0.0006003783  0.0039224281 0.0007794673
## 1415672_at    0.005256812  0.004795409  0.0038925699  0.0121304146 0.0071950041
## 1415673_at    0.004030073  0.010959578  0.0009526538 -0.0009134531 0.0153952247
## 1415674_a_at -0.001662793  0.004511165 -0.0031626816 -0.0095877654 0.0094023984
## 1415675_at    0.001389425  0.003317438 -0.0053758542 -0.0003588282 0.0039269766
##                      PC15         PC16         PC17          PC18         PC19
## 1415670_at   -0.002024541 -0.007933164  0.004693290 -0.0054003705  0.011277624
## 1415671_at    0.005724788  0.003436805 -0.002077939 -0.0007522099  0.006112120
## 1415672_at    0.004977242  0.002113614 -0.002270352 -0.0028362693 -0.004320122
## 1415673_at   -0.001152743  0.007960091 -0.010852450  0.0015558016  0.002184231
## 1415674_a_at -0.005594072 -0.004774659  0.006742033  0.0038316489  0.001544205
## 1415675_at    0.002248148 -0.001031133 -0.004794915  0.0037296423 -0.007536217
##                       PC20         PC21          PC22          PC23
## 1415670_at   -0.0003459773  0.005870296 -5.150112e-03 -0.0101484441
## 1415671_at   -0.0058257529 -0.005988319 -7.323003e-05 -0.0034637618
## 1415672_at    0.0015487991 -0.007711192 -2.522141e-03 -0.0005277051
## 1415673_at    0.0091644327 -0.012221084 -1.120458e-02 -0.0086279261
## 1415674_a_at  0.0072802517  0.007684839  1.105464e-02 -0.0112289638
## 1415675_at   -0.0010487991  0.001723039 -4.288638e-04  0.0007629944
##                       PC24         PC25         PC26          PC27
## 1415670_at    0.0073183968  0.004279932  0.001592531 -0.0003035332
## 1415671_at    0.0006629165  0.001967735  0.000900362 -0.0028012014
## 1415672_at   -0.0072760014 -0.004312205 -0.002955334 -0.0023008907
## 1415673_at   -0.0071533392 -0.002311212 -0.002530873  0.0016295761
## 1415674_a_at -0.0009333477  0.003766611  0.006412702  0.0072702217
## 1415675_at    0.0028385323  0.002066559  0.005118293  0.0038765246
##                       PC28          PC29          PC30          PC31
## 1415670_at   -0.0169133321  0.0049902657  0.0015767834 -0.0043764220
## 1415671_at   -0.0001577811  0.0050894842 -0.0009716651 -0.0025360045
## 1415672_at    0.0048648207  0.0016042690 -0.0006260298 -0.0037836881
## 1415673_at   -0.0043930604  0.0089853585  0.0053592638 -0.0083469057
## 1415674_a_at  0.0010762637 -0.0006644469  0.0068933666 -0.0006637386
## 1415675_at   -0.0006258895  0.0083771062 -0.0051551459  0.0043437967
##                       PC32          PC33          PC34          PC35
## 1415670_at   -0.0132427349 -0.0125690143  0.0029455659 -0.0023420230
## 1415671_at    0.0028288149 -0.0085382545 -0.0009683719  0.0018491542
## 1415672_at   -0.0040777161  0.0005711196 -0.0006504036 -0.0022125457
## 1415673_at   -0.0008271518  0.0017922902 -0.0060014204 -0.0033619356
## 1415674_a_at  0.0007189899 -0.0025768829  0.0084655447 -0.0063167693
## 1415675_at    0.0058050368 -0.0022522456  0.0083507698  0.0009909198
##                      PC36          PC37          PC38         PC39         PC40
## 1415670_at   -0.014843580  0.0033021159  4.154909e-03  0.007139855  0.007101742
## 1415671_at   -0.002249954 -0.0029423748  3.767350e-03 -0.002140462  0.001088846
## 1415672_at   -0.001644227  0.0006001297  1.574321e-04  0.004835232 -0.001709069
## 1415673_at   -0.004368242 -0.0086477183 -2.005622e-03  0.003049362 -0.004304558
## 1415674_a_at -0.004540805 -0.0004888262 -5.644571e-05 -0.010049749 -0.003984100
## 1415675_at   -0.002238255 -0.0051113461 -3.533706e-03  0.005118077 -0.006987603
##                       PC41          PC42          PC43         PC44
## 1415670_at    0.0072944299 -0.0077032356 -0.0083100172  0.009692490
## 1415671_at    0.0004821372  0.0037687170  0.0019168201  0.005691384
## 1415672_at   -0.0012185210  0.0060552842  0.0062198948 -0.002881595
## 1415673_at   -0.0026912620  0.0143050771 -0.0038062107  0.016330662
## 1415674_a_at  0.0088863583  0.0067148024 -0.0024954014  0.003030388
## 1415675_at   -0.0072624360 -0.0005929378 -0.0003836008 -0.009955271
##                       PC45          PC46          PC47          PC48
## 1415670_at    0.0008510451 -0.0023521812  0.0007497025 -0.0015339174
## 1415671_at   -0.0038953526 -0.0010845030  0.0038389205  0.0033034045
## 1415672_at   -0.0001807478  0.0007937296  0.0001839559  0.0034370703
## 1415673_at   -0.0095253776  0.0065479723 -0.0038568316 -0.0027278611
## 1415674_a_at  0.0050113621 -0.0066850609 -0.0020586133 -0.0015481267
## 1415675_at   -0.0026880595  0.0087559893 -0.0019918024  0.0005238969
##                       PC49          PC50         PC51          PC52
## 1415670_at    2.570830e-03  0.0038679054  0.006562584  0.0111478424
## 1415671_at    1.307988e-04  0.0056195136  0.002429074  0.0068277842
## 1415672_at   -2.199466e-03 -0.0014555394  0.003212533  0.0058276901
## 1415673_at   -2.788902e-03 -0.0029146156 -0.007245381  0.0007436973
## 1415674_a_at  5.945196e-03  0.0084261738  0.008550839 -0.0016002492
## 1415675_at    2.966112e-05  0.0005442661 -0.007366196 -0.0012449685
##                      PC53          PC54          PC55          PC56
## 1415670_at   -0.003000186 -4.381552e-03 -0.0008523147 -0.0007264979
## 1415671_at   -0.002293535  5.118472e-03 -0.0040931497 -0.0014534872
## 1415672_at    0.002995610 -1.833653e-05 -0.0005871680 -0.0046758160
## 1415673_at   -0.006887592 -2.262716e-03  0.0024347304 -0.0067067481
## 1415674_a_at  0.001611599  1.063608e-02  0.0063212849 -0.0063912600
## 1415675_at   -0.006093395  3.147469e-03 -0.0027669280  0.0024725321
##                       PC57         PC58          PC59         PC60
## 1415670_at   -0.0117290929 -0.002809904 -0.0023605376  0.011005580
## 1415671_at    0.0002450397 -0.010001506  0.0019575104 -0.005691451
## 1415672_at   -0.0031492060  0.003782941 -0.0041915031 -0.003626497
## 1415673_at   -0.0057788760  0.011241647  0.0104881307  0.010600036
## 1415674_a_at  0.0072171738 -0.003169918 -0.0080582938 -0.002578921
## 1415675_at    0.0015803825 -0.003206346 -0.0009206298  0.001634349
##                       PC61          PC62         PC63         PC64
## 1415670_at   -0.0014818921 -0.0030730713 -0.014675511  0.001425725
## 1415671_at    0.0009040080 -0.0019327206  0.003312589  0.003848085
## 1415672_at    0.0009793205  0.0003033953 -0.002549704 -0.006455380
## 1415673_at    0.0003951108  0.0025646381  0.004046998  0.002481143
## 1415674_a_at -0.0053331434  0.0044613364  0.001754157  0.005437242
## 1415675_at    0.0064818214  0.0002527017 -0.004324612  0.001910406
##                       PC65        PC66
## 1415670_at   -0.0053754489 -0.08169656
## 1415671_at    0.0029983941 -0.10659056
## 1415672_at   -0.0044101609  0.10272359
## 1415673_at   -0.0049158855 -0.11225889
## 1415674_a_at -0.0184980378  0.24311582
## 1415675_at   -0.0005484727  0.03042199

7.4b Make a biplot showing the labels of the extreme gene-variables that explain most of the variance in the first plane. Add the the sample-points.

fviz_pca_ind(xwt_pca) + ggtitle("") + ylim(c(-2.5,5.7))
## Warning: Removed 61 rows containing missing values (geom_point).
## Warning: Removed 61 rows containing missing values (geom_text).